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ABSTRACT 

We introduce a probabilistic approach to the LMS filter. 
By means of an efficient approximation, this approach pro¬ 
vides an adaptable step-size LMS algorithm together with a 
measure of uncertainty about the estimation. In addition, the 
proposed approximation preserves the linear complexity of 
the standard LMS. Numerical results show the improved per¬ 
formance of the algorithm with respect to standard LMS and 
state-of-the-art algorithms with similar complexity. The goal 
of this work, therefore, is to open the door to bring some more 
Bayesian machine learning techniques to adaptive filtering. 

Index Terms — probabilistic models, least-mean-squares, 
adaptive filtering, state-space models 

1. INTRODUCTION 

Probabilistic models have proven to be very useful in a lot 
of applications in signal processing where signal estimation 
is needed ci in a . Some of their advantages are that 1) they 
force the designer to specify all the assumptions of the model, 
2) they provide a clear separation between the model and the 
algorithm used to solve it, and 3) they usually provide some 
measure of uncertainty about the estimation. 

On the other hand, adaptive filtering is a standard ap¬ 
proach in estimation problems when the input is received 
as a stream of data that is potentially non-stationary. This 
approach is widely understood and applied to several prob¬ 
lems such as echo cancellation El , noise cancellation Q, and 
channel equalization j6j. 

Although these two approaches share some underlying re¬ 
lations, there are very few connections in the literature. The 
first important attempt in the signal processing community to 
relate these two fields was the connection between a linear 
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Gaussian state-space model (i.e. Kalman filter) and the RLS 
filter, by Sayed and Kailath 0 and then by Haykin et at. 0. 
The RLS adaptive filtering algorithm emerges naturally when 
one defines a particular state-space model (SSM) and then 
performs exact inference in that model. This approach was 
later exploited in 0 to design a kernel RLS algorithm based 
on Gaussian processes. 

A first attempt to approximate the LMS filter from a 
probabilistic perspective was presented in fTOl . focusing on 
a kernel-based implementation. The algorithm of flOl makes 
use of a Maximum a Posteriori (MAP) estimate as an approx¬ 
imation for the predictive step. However, this approximation 
does not preserve the estimate of the uncertainty in each step, 
therefore degrading the performance of the algorithm. 

In this work, we provide a similar connection between 
state-space models and least-mean-squares (LMS). Our ap¬ 
proach is based on approximating the posterior distribution 
with an isotropic Gaussian distribution. We show how the 
computation of this approximated posterior leads to a linear- 
complexity algorithm, comparable to the standard LMS. Sim¬ 
ilar approaches have already been developed for a variety of 
problems such as channel equalization using recurrent RBF 
neural networks im, or Bayesian forecasting E3- Here, we 
show the usefulness of this probabilistic approach for adap¬ 
tive filtering. 

The probabilistic perspective we adopt throughout this 
work presents two main advantages. Firstly, a novel LMS 
algorithm with adaptable step size emerges naturally with 
this approach, making it suitable for both stationary and non- 
stationary environments. The proposed algorithm has less 
free parameters than previous LMS algorithms with variable 
step size EDEIED, and its parameters are easier to be tuned 
w.r.t. these algorithms and standard LMS. Secondly, the use 
of a probabilistic model provides us with an estimate of the 
error variance, which is useful in many applications. 

Experiments with simulated and real data show the ad¬ 
vantages of the presented approach with respect to previous 
works. However, we remark that the main contribution of this 
paper is that it opens the door to introduce more Bayesian 
machine learning techniques, such as variational inference 
and Monte Carlo sampling methods ED , to adaptive filtering. 



2. PROBABILISTIC MODEL 

Throughout this work, we assume the observation model to 
be linear-Gaussian with the following distribution, 

p(t/fc|w fe ) =AT(y k ;xlw k ,crl), (1) 

where is the variance of the observation noise, x;, : is the 
regression vector and w/. is the parameter vector to be se¬ 
quentially estimated, both M-dimensional column vectors. 

In a non-stationary scenario, w/. follows a dynamic pro¬ 
cess. In particular, we consider a diffusion process (random- 
walk model) with variance er^ for this parameter vector: 

P(w fe |wfc_i) = Af(w fe ; w fc _i, 0 -rfl), (2) 

where I denotes the identity matrix. In order to initiate the 
recursion, we assume the following prior distribution on w/ r 


scalar summarizing the variance of the estimate could prove 
to be sufficiently useful. In the next section, we show how 
such a scalar is obtained naturally whenp(wfc|y 1: k) is approx¬ 
imated with an isotropic Gaussian distribution. We also show 
that this approximation leads to an LMS-like estimation. 

4. APPROXIMATING THE POSTERIOR 
DISTRIBUTION: LMS FILTER 

The proposed approach consists in approximating the poste¬ 
rior distribution p^kluv.k), in general a multivariate Gaus¬ 
sian distribution with a full covariance matrix, by an isotropic 
spherical Gaussian distribution 

p(w fe |t/i :fc ) = J\f{w k -,ii k ,alT). (4) 

In order to estimate the mean and covariance of the ap¬ 
proximate distribution p(wk\yi-.k), we propose to select those 
that minimize the Kullback-Leibler divergence with respect 
to the original distribution, i.e., 


p(w 0 ) =7V’(w 0 ;0,o^I). 


3. EXACT INFERENCE IN THIS MODEL: 
REVISITING THE RLS FILTER 

Given the described probabilistic SSM, we would like to infer 
the posterior probability distribution p(wk\yi:k)- Since all 
involved distributions are Gaussian, one can perform exact 
inference, leveraging the probability rules in a straightforward 
manner. The resulting probability distribution is 

p( w k\yi:k) = 

in which the mean vector p, k is given by 

Mk = Mfc-i + K k {yk - Xfe Mfc-t)xfc, 

where we have introduced the auxiliary variable 

K (S fc _i + Q'd 1 ) 

xj'(i i . 1 +^i)x l + ff r 

and the covariance matrix S*. is obtained as 

= (I - K fc x fe Xfc) (S fc _i + < t %), 

Note that the mode of p(y/k\yi:k )> Le. the maximum-a- 
posteriori estimate (MAP), coincides with the RLS adaptive 
rule 


(RLS) (RLS) . -wr ( T (RLS)' 

w fc = W l-i + K k{Vk - x fc w^_ 1 ; )x fc - 0) 

This rule is similar to the one introduced in (HJ . 

Finally, note that the covariance matrix £k is a measure 
of the uncertainty of the estimate w/ ; conditioned on the ob¬ 
served data yi-.k- Nevertheless, for many applications a single 


{Afc,Ofc} = arg min {D KL (p(w fc |yi ;fe ))||p(w fe |j/i ;fc ))}. 

R-k>°k 


The derivation of the corresponding minimization prob¬ 
lem can be found in Appendix A. In particular, the optimal 
mean and the covariance are found as 


A k — Mfc) 


-2 _ Tr{Sfc} 
CTfe M 


(5) 


We now show that by using {4]) in the recursive pre¬ 
dictive and filtering expressions we obtain an LMS-like 
adaptive rule. First, let us assume that we have an approx¬ 
imate posterior distribution at k — 1, p(wfc_i \yi-.k-i) = 
jV(wfc_i; Ak-nOfc^I). Since all involved distributions are 
Gaussian, the predictive distribution is obtained as 


A(Wfc|yi:k_l) 


y'^Wfclwfc—i)p(w fc —i|yi:fc—i)dw fc _1 


= Af(w fc ;/z fc | fe _ 1 ,£ fc | fc _i), (6) 

where the mean vector and covariance matrix are given by 


Ak|fc-i — Afe-i 

= (Afc_ i + o*)I. 

From (|6j, the posterior distribution at time k can be com¬ 
puted using Bayes’ Theorem and standard Gaussian manipu¬ 
lations (see for instance m Ch. 4]). Then, we approximate 
the posterior p(wk|y 1: k) with an isotropic Gaussian, 

P(w fc |t/i :fc ) = SS{w k -, Ak.Afcl), 


where 

A k = 


Afc-i + Vk(yk 


K 2 -1+^) 

+ v 2 d) llxfcll 2 + 0 - 

-XfeAk-JXfc. 


2 

n 


(yfe -XfcA fc -i)xfc 


(7) 





Note that, instead of a gain matrix Kf, as in Eq. (0, we now 
have a scalar gain rjk that operates as a variable step size. 

Finally, to obtain the posterior variance, which is our mea¬ 
sure of uncertainty, we apply 0 and the trick Trjx/.x]]} = 


X^Xfc = ||x fc || 2 , 


Tr(S fc ) 

M 

= JJ Jr {i * 1 - Vk^kX-l) (<5fc_r + cr^)} 

= i + ct 2 )_ 


( 8 ) 

(9) 

( 10 ) 


If MAP estimation is performed, we obtain an adaptable step- 
size LMS estimation 


( LMS) 


r (LM s ) + _ x 2 W^ 0) )x k , (11) 


T(LMS), 


with 


r)k = 




v>Li + cr d)ll x fell 2 + cr »' 

At this point, several interesting remarks can be made: 


In stationary environments, the proposed algorithm has 
only one parameter, cr 2 . We simulate both the scenario where 
we have perfectly knowledge of the amount of noise (prob- 
LMS1) and the case where the value er 2 is 100 times smaller 
than the actual value (probLMS2). The Mean-Square Devia¬ 
tion (MSD = E||wo — Wfc|| 2 ), averaged out over 50 indepen¬ 
dent simulations, is presented in Fig. [7] 



Fig. 1. Performance in terms of MSD of probabilistic FMS 
with both optimal (probFMSl) and suboptimal (probFMS2) 
compared to FMS, NLMS, VS-FMS, and RFS. 


• The adaptive rule i fTTb has linear complexity since it 
does not require us to compute the full matrix £/.. 

• For a stationary model, we have crj = 0 in 0 and ( I I Ok 
In this case, the algorithm remains valid and both the 
step size and the error variance, 07 ., vanish over time k. 

• Finally, the proposed adaptable step-size FMS has only 
two parameters, and cr 2 , (and only one, er 2 , in sta¬ 
tionary scenarios) in contrast to other variable step-size 
algorithms Qjj HU |I5| . More interestingly, both cr 2 
and cr 2 have a clear underlying physical meaning, and 
they can be estimated in many cases. We will comment 
more about this in the next section. 


The performance of probabilistic FMS is close to RFS 
(obviously at a much lower computational cost) and largely 
outperforms previous variable step-size FMS algorithms pro¬ 
posed in the literature. Note that, when the model is station¬ 
ary, i.e. < 7 ^ = 0 in 0, both the uncertainty a 2 , and the adap¬ 
tive step size r/k, vanish over time. This implies that the error 
tends to zero when k goes to infinity. Fig. |T] also shows that 
the proposed approach is not very sensitive to a bad choice 
of its only parameter, as demonstrated by the good results of 
probFMS2, which uses a er 2 that is 100 times smaller than the 
optimal value. 


5. EXPERIMENTS 

We evaluate the performance of the proposed algorithm in 
both stationary and tracking experiments. In the first experi¬ 
ment, we estimate a fixed vector w° of dimension M = 50. 
The entries of the vector are independently and uniformly 
chosen in the range [—1,1]. Then, the vector is normalized 
so that ||w°|| = 1. Regressors Xk are zero-mean Gaussian 
vectors with identity covariance matrix. The additive noise 
variance is such that the SNR is 20 dB. We compare our al¬ 
gorithm with standard RFS and three other FMS-based algo¬ 
rithms: FMS, NFMS 01, VSS-FMS iTBl FI The probabilis¬ 
tic FMS algorithm in fTOl is not simulated because it is not 
suitable for stationary environments. 

'The used parameters for each algorithm are: for RLS A = 1, e ~ 1 = 
0.01; for LMS /r = 0.01; for NLMS /r = 0.5; and for VSS-LMS = 

1, a = 0.95, C = le - 4. 



Fig. 2. Real part of one coefficient of the measured and esti¬ 
mated channel in experiment two. The shaded area represents 
two standard deviations from the prediction (the mean of the 
posterior distribution). 


















| Method | LMS 

|NLMS 

|LMS-2013 

|VSSNLMS 

| probLMS | RLS | 

| MSD (dB) |-28.45 

| -21.07 

-14.36 

-26.90 

| -28.36 |-25.97 | 


Table 1. Steady-state MSD of the different algorithms for the 
tracking of a real MISO channel. 


In a second experiment, we test the tracking capabilities 
of the proposed algorithm with real data of a wireless MISO 
channel acquired in a realistic indoor scenario. More details 
on the setup can be found in |l9). Fig. [2] shows the real part 
of one of the channels, and the estimate of the proposed algo¬ 
rithm. The shaded area represents the estimated uncertainty 
for each prediction, i.e. /tfc ± 2d>. Since the experimental 
setup does not allow us to obtain the optimal values for the pa¬ 
rameters, we fix these parameters to their values that optimize 
the steady-state mean square deviation (MSD). Table[l]shows 
this steady-state MSD of the estimate of the MISO channel 
with different methods. As can be seen, the best tracking 
performance is obtained by standard LMS and the proposed 
method. 

6. CONCLUSIONS AND OPENED EXTENSIONS 

We have presented a probabilistic interpretation of the least- 
mean-square filter. The resulting algorithm is an adaptable 
step-size LMS that performs well both in stationary and track¬ 
ing scenarios. Moreover, it has fewer free parameters than 
previous approaches and these parameters have a clear physi¬ 
cal meaning. Finally, as stated in the introduction, one of the 
advantages of having a probabilistic model is that it is easily 
extensible: 

• If, instead of using an isotropic Gaussian distribution in 
the approximation, we used a Gaussian with diagonal 
covariance matrix, we would obtain a similar algorithm 
with different step sizes and measures of uncertainty, 
for each component of w^. Although this model can be 
more descriptive, it needs more parameters to be tuned, 
and the parallelism with LMS vanishes. 

• Similarly, if we substitute the transition model of © by 
an Ornstein-Uhlenbeck process, 


p(w fc |w fc _i) =JV(wjt;Awi fc _ 1 ,^), 

a similar algorithm is obtained but with a forgetting fac¬ 
tor A multiplying in (fill . This algorithm may 

have improved performance under such a kind of au- 
toregresive dynamics of w*,, though, again, the connec¬ 
tion with standard LMS becomes dimmer. 


• A similar approximation technique could be applied 
to more complex dynamical models, i.e. switching 
dynamical models EH). The derivation of efficient 
adaptive algorithms that explicitly take into account a 
switch in the dynamics of the parameters of interest is 
a non-trivial and open problem, though the proposed 
approach could be useful. 

• Finally, like standard LMS, this algorithm can be ker- 
nelized for its application in estimation under non¬ 
linear scenarios. 

A. KL DIVERGENCE BETWEEN A GENERAL 
GAUSSIAN DISTRIBUTION AND AN ISOTROPIC 
GAUSSIAN 

We want to approximate p Xl (x) = A/”(x; fi 1 , Si) byp X2 (x) 
Ax: fj, 2l erf I). In order to do so, we have to compute the 
parameters ofp X2 (x), // 2 and erf, that minimize the following 
Kullback-Leibler divergence. 


Dkl(p Xi \\p X2 ) = [ p Xl (x) In Pxi ^ dx 

J-oo Px 2 (x) 

= i{—M + Tr^I-Sr 1 ) 

“KM 2 — Ml) a 2 I(M2 — Ml) 

2 M 

+b, 5fe>- (12 > 

Using symmetry arguments, we obtain 

M 2 = argmin{Difi(p Xl ||p X2 )} = Mi- (13) 


M 2 


Then, (l~i~2l) gets simplified into 


1 5] (7 2M 

£>ifi(p Xl |bx 2 ) = -{—M + Tr(-|-) +ln (14) 

The variance erf is computed in order to minimize this 
Kullback-Leibler divergence as 


= arg min Dkl(Px 1 \\Px 2 ) 


= arg min {cr! 2 Tr{£i} + M In erf}. (15) 
Deriving and making it equal zero leads to 


d 

dal 




M In erf 


M Tr{Sx} 


(*Z) 


2\2 


= 0 . 


• As in m, the measurement model £[]) can be changed 
to obtain similar adaptive algorithms for classification, 
ordinal regression, and Dirichlet regression for compo¬ 
sitional data. 


Finally, since the divergence has a single extremum in R + , 


a 


2 * 

2 


Tr{Si} 

M 


(16) 
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